Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File deduplication #6332

Merged
merged 44 commits into from
Jan 21, 2025
Merged

File deduplication #6332

merged 44 commits into from
Jan 21, 2025

Conversation

Hocuri
Copy link
Collaborator

@Hocuri Hocuri commented Dec 11, 2024

When receiving messages, blobs will be deduplicated with the new function create_and_deduplicate_from_bytes(). For sending files, this adds a new function set_file_and_deduplicate() instead of deduplicating by default.

This is for #6265; read the issue description there for more details.

TODO:

  • Set files as read-only
  • Don't do a write when the file is already identical
  • The first 32 chars or so of the 64-character hash are enough. I calculated that if 10b people (i.e. all of humanity) use DC, and each of them has 200k distinct blob files (I have 4k in my day-to-day account), and we used 20 chars, then the expected value for the number of name collisions would be ~0.0002 (and the probability that there is a least one name collision is lower than that) 1. I added 12 more characters to be on the super safe side, but this wouldn't be necessary and I could also make it 20 instead of 32.
    • Not 100% sure whether that's necessary at all - it would mainly be necessary if we might hit a length limit on some file systems (the blobdir is usually sth like accounts/2ff9fc096d2f46b6832b24a1ed99c0d6/dc.db-blobs (53 chars), plus 64 chars for the filename would be 117).
  • "touch" the files to prevent them from being deleted
  • TODOs in the code

For later PRs:

  • Replace BlobObject::create(…) with BlobObject::create_and_deduplicate(…) in order to deduplicate everytime core creates a file
  • Modify JsonRPC to deduplicate blob files
  • Possibly rename BlobObject.name to BlobObject.file in order to prevent confusion (because name usually means "user-visible-name", not "name of the file on disk").

Footnotes

  1. Calculated with both https://printfn.github.io/fend/ and https://www.geogebra.org/calculator, both of which came to the same result (1,
    2)

src/blob.rs Show resolved Hide resolved
deltachat-ffi/src/lib.rs Outdated Show resolved Hide resolved
deltachat-ffi/src/lib.rs Outdated Show resolved Hide resolved
src/blob.rs Outdated Show resolved Hide resolved
src/blob.rs Show resolved Hide resolved
src/blob.rs Show resolved Hide resolved
src/blob.rs Outdated Show resolved Hide resolved
src/blob.rs Outdated Show resolved Hide resolved
src/blob.rs Show resolved Hide resolved
src/blob.rs Outdated Show resolved Hide resolved
src/blob.rs Show resolved Hide resolved
src/blob.rs Outdated Show resolved Hide resolved
src/chat.rs Outdated Show resolved Hide resolved
src/mimefactory.rs Show resolved Hide resolved
src/mimeparser.rs Outdated Show resolved Hide resolved
src/receive_imf/tests.rs Outdated Show resolved Hide resolved
src/sql.rs Outdated Show resolved Hide resolved
src/summary.rs Show resolved Hide resolved
src/webxdc.rs Outdated Show resolved Hide resolved
@Hocuri Hocuri force-pushed the hoc/file-deduplication branch from 20319b4 to 1fecd17 Compare January 19, 2025 19:55
@Hocuri Hocuri force-pushed the hoc/file-deduplication branch from 67e8651 to abeb3d7 Compare January 20, 2025 17:44
@Hocuri Hocuri force-pushed the hoc/file-deduplication branch from abeb3d7 to 1761804 Compare January 20, 2025 19:58
@Hocuri Hocuri requested a review from link2xt January 21, 2025 13:09
src/blob.rs Outdated Show resolved Hide resolved
src/blob.rs Outdated Show resolved Hide resolved
Hocuri and others added 2 commits January 21, 2025 18:26
1. We wouldn't be able to easily revert it since we would have some
   read-only, some writeable files in the blobdir
2. It proved more complex than thought, with Windows having different
   behavior than others
3. It didn't buy us anything really, because Desktop will anyway have to copy the file to a temporary one in order to show it, and on other platforms the user can't edit the blob files, anyway
Co-authored-by: l <link2xt@testrun.org>
@Hocuri Hocuri merged commit 65a9c4b into main Jan 21, 2025
37 checks passed
@Hocuri Hocuri deleted the hoc/file-deduplication branch January 21, 2025 18:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants